Summary

### What are Policy Gradient Methods?

Policy-based methods are a class of algorithms that search directly for the optimal policy, without simultaneously maintaining value function estimates.
Policy gradient methods are a subclass of policy-based methods that estimate the weights of an optimal policy through gradient ascent.
In this lesson, we represent the policy with a neural network, where our goal is to find the weights \theta of the network that maximize expected return.

The policy gradient method will iteratively amend the policy network weights to:
- make (state, action) pairs that resulted in positive return more likely, and
- make (state, action) pairs that resulted in negative return less likely.

A trajectory \tau is a state-action sequence s_0, a_0, \ldots, s_H, a_H, s_{H+1}.
In this lesson, we will use the notation R(\tau) to refer to the return corresponding to trajectory \tau.
Our goal is to find the weights \theta of the policy network to maximize the expected return U(\theta) := \sum_\tau \mathbb{P}(\tau;\theta)R(\tau).

Use the policy \pi_\theta to collect m trajectories { \tau^{(1)}, \tau^{(2)}, \ldots, \tau^{(m)}} with horizon H. We refer to the i-th trajectory as
\tau^{(i)} = (s_0^{(i)}, a_0^{(i)}, \ldots, s_H^{(i)}, a_H^{(i)}, s_{H+1}^{(i)})
.
Use the trajectories to estimate the gradient \nabla_\theta U(\theta):
\nabla_\theta U(\theta) \approx \hat{g} := \frac{1}{m}\sum_{i=1}^m \sum_{t=0}^{H} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) R(\tau^{(i)})
Update the weights of the policy:
\theta \leftarrow \theta + \alpha \hat{g}
Loop over steps 1-3.

We derived the likelihood ratio policy gradient: \nabla_\theta U(\theta) = \sum_\tau \mathbb{P}(\tau;\theta)\nabla_\theta \log \mathbb{P}(\tau;\theta)R(\tau) .
We can approximate the gradient above with a sample-weighted average:
\nabla_\theta U(\theta) \approx \frac{1}{m}\sum_{i=1}^m \nabla_\theta \log \mathbb{P}(\tau^{(i)};\theta)R(\tau^{(i)})
.
We calculated the following:
\nabla_\theta \log \mathbb{P}(\tau^{(i)};\theta) = \sum_{t=0}^{H} \nabla_\theta \log \pi_\theta (a_t^{(i)}|s_t^{(i)})
.

REINFORCE can solve Markov Decision Processes (MDPs) with either discrete or continuous action spaces.